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ABSTRACT 


Enhanced Super-Resolution Generative Adversarial Network 
(ESRGAN) is a perceptual-driven approach for single image 
super-resolution that is able to produce photorealistic images. 
Despite the visual quality of these generated images, there 
is still room for improvement. In this fashion, the model is 
extended to further improve the perceptual quality of the im- 
ages. We have designed a network architecture with a novel 
basic block to replace the one used by the original ESRGAN. 
Moreover, we introduce noise inputs to the generator net- 
work in order to exploit stochastic variation. The resulting 
images present more realistic textures. The code is available 


athhttps://github.com/ncarraz/ESRGANplus 


Index Terms— Super-resolution, Generative adversarial 
network 


1. INTRODUCTION 


Super-resolution (SR) is the task of generating a high- 
resolution (HR) image using low-resolution (LR) ones. When 
only one LR image is used, it is commonly called Single Im- 
age Super-Resolution (SISR). The target of such task used 
to be the minimization of the mean squared error (MSE) 
between the generated image and the original one. This re- 
sults in maximizing the peak signal-to-ratio (PSNR) which 
is a standard measure for SISR. However, PSNR-oriented 
approaches do not generate perceptually good images [I]. 
Perceptual-oriented methods were then proposed. Super- 
Resolution Generative Adversarial Network (SRGAN) 
uses both perceptual loss and generative adversarial 
networks (GANS) [4] to produce images residing in the man- 
ifold of natural images. Enhanced Super-Resolution Genera- 
tive Adversarial Network (ESRGAN) improves SRGAN 
by introducing an architecture composed of Residual-in- 
Residual Dense Blocks (RRDB) without Batch Normaliza- 
tion (BN) [6] layers. Besides, relativistic average GAN 
(RaGAN) was used as the discriminator and the features 
were used before activation. 

We aim to further improve the perceptual quality of the 
images generated by ESRGAN. First, we propose a new block 
called Residual-in-Residual Dense Residual Block (RRDRB) 
which has higher capacity than ESRGAN’s RRDB block. 


Second, we introduce noise inputs in the network as in in 
order to benefit from stochastic variation. 


2. RELATED WORK 


The main approaches to SISR can be divided into three dis- 
tinct categories: interpolation-based methods, reconstruction- 
based methods and learning-based methods [9]. Approaches 
based on deep learning have further surpassed the two former 
methods as well as simple learning-based methods. 


The very first deep learning-based approach, proposed by 
Dong et al. [IT], is SRCNN. It makes use of convolu- 
tional neural networks in an end-to-end manner. Though the 
network is shallow, it outsmarted previous techniques as far as 
the SISR task is concerned. Kim et al. introduce a deeper 
model called VDSR. With a similar performance, DRCN 
exploits deep recursive networks by combining intermediary 
results. SRResNet [1] and DRRN make use of residual 
units. EDSR along with MDSR, its multiple scale fac- 
tors version, are the state-of-the-art methods for PSNR-based 
super-resolution. Residual dense networks were used in SR- 
DenseNet and Memnet [16]. 


In order to focus more on the visual quality of gener- 
ated images, a perceptual loss closer to perceptual quality is 
proposed. SRGAN which is based on GANs uses this per- 
ceptual loss along with adversarial loss to produce photo- 
realistic images. These images are visually more convincing 
despite having lower score on standard quantitative measure 
like PSNR and structural similarity (SSIM). EnhanceNet 
is also based on GANs but uses a different architecture. ES- 
RGAN as its name implies enhances SRGAN. It introduced 
a new block with a higher capacity named RRDB. Besides 
BN layers were removed, residual scaling and smaller 
initialization were used to facilitate training a very deep net- 
work. The discriminator uses relativistic average GAN, which 
learns to evalate “whether one image is more realistic than the 
other” rather than “whether one image is real or fake”. Fur- 
thermore, in the perceptual loss, the VGG features are taken 
before activation rather than after as in SRGAN. There is still 
a gap between ground-truth images and images generated by 
ESRGAN. The present work aims to further close this gap. 


3. METHOD 


3.1. Network architecture 


ESRGAN’s basic block allows the network to be easier to 
train and have a very high capacity. The overall architecture 
of ESRGAN is maintained as depicted in Figure[l]except for 
the Dense block which is replaced by our new block. 


Fig. 1: The basic block used in ESRGAN called Residual in 
Residual Dense Block (RRDB). 


The novel block we propose results in greater capacity. 
RRDB has a residual-in-residual structure with Dense blocks 
in the main path. We add an additional level of residual 
learning inside the Dense blocks as presented in Figure [2] to 
augment the network capacity without increasing its complex- 
ity. A residual is then added every two layers in each Dense 
block. The visual quality of the generated images using the 
new block is substantially superior to that of the simple Dense 
block. As described in [19], ResNet enables to re-use fea- 
tures while DenseNet enables to find new features. This new 
architecture then benefits from both feature exploitation and 
exploration resulting in images of superior perceptual quality. 
We name ESRGAN+ the model using this new architecture. 


3.2. Noise inputs 


Adding noise to the generator was recently used in human 
faces generation [8] which also heavily relies on GANs. How- 
ever, it was never applied to super-resolution. In order to have 
stochastic detail, noise inputs are introduced in the generator’s 
architecture. Gaussian noise is added to the output of each 
residual dense block along with learned per-feature scaling 
factors , as illustrated in Figure[3] 


Stochastic variation randomizes only certain local aspects 
of the generated images without changing our global percep- 
tion of the images [8]. The effects of the noise inputs are very 
localized leaving intact the general structure and the higher 
level information of the images. The network does not need to 
generate spatially-varying pseudorandom numbers when that 
is required. Consequently, the network capacity that would 
have been wasted for that task can be efficiently used to give 
finer-details in the high-level aspects. The model using both 
the new block and the noise inputs is called nRESRGAN+. 


(a) Dense block 


(b) Residual Dense block 


Fig. 2: Top: Dense block is the main path used in ESRGAN’s 
RRDB. Bottom: Residuals are added every two layer in the 
Dense block. 


Fig. 3: Gaussian noise is added after each residual along with 
a learned scaling-factor. 


4. EXPERIMENTS 


4.1. Data 


The used training set is DIV2K [20]. It is a dataset of 2K res- 
olution images adequate for the task of SR. Originally, there 
are only 800 images in the DIV2K dataset. As in ESRGAN, 
data augmentation is performed through random horizontal 
flips and rotations. The benchmark datasets used for evalu- 
ation are BSD100 [21], Urban100 [22], OST300 [23], Set5 
[24], Set14 and the PIRM datasets [26]. 


4.2. Training details and parameters 


The LR images are obtained by downsampling the HR images 
using bicubic kernel with a scaling factor of x4. We maintain 
all the training parameters of the original ESRGAN. We crop 
128 x 128 HR sub images. The size of the mini-batch is 16. 
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Fig. 4: Comparison between the qualitative results of the main perceptual-driven models and ESRGAN+ using images from 
Set14. PNSR (value on the left) and perceptual index (value on the right) are used for the evaluation. 


Table 1: Quantitative evaluation of our models with other perceptual-driven methods. The best and second best results are 
highlighted and underlined, respectively. We evaluate using the perceptual index (value on the right) but PSNR (value on the 
left) is also given for reference purposes. 


EnhanceNet ESRGAN ESRGAN+ (ours) nESRGAN+G+ (ours) 
Validation PIRM 25.06/2.68 25.17/2.55 24/2.38 24.32/2.36 
Test PIRM 24.94/2.72 25.03/2.43 23.80/2.31 24.15/2.29 
Urban100 23.54/3.47 24.36/3.77 23.28/3.55 23.22/3.55 
OST300 24.37/2.82 24.64/2.49 23.84/2.46 23.80/2.49 


A PSNR-oriented pre-trained model is used to initialize the 
generator. The loss function remains unchanged with \ = 
5 x 10-3 and 7 = 1 x 1077. The learning rate is set to 
1 x 1074 and halved at [50k, 100k,200k,300k] iterations. 

The model is optimized using Adam with (; = 0.9 and 
Bz = 0.999. The trained model is the one with the 23 blocks 
generator. The implementation is done with Pytorch and the 
training with NVIDIA Tesla K80 GPUs. 


4.3. Results 


We evaluate our two models with other perceptual-driven ap- 
proaches on the PIRM datasets (see Table [I). In the YCbCr 
color space, PSNR is measured on the luminance channel. 
The perceptual index is the one used in the PIRM-SR Chal- 
lenge [26]. It is based on the Ma’s score and NIQE 
and equals $((10 — Ma) + NIQE). Higher is better when 
measuring with the PSNR whereas lower is better when con- 
sidering the perceptual index. Both of our models always 
perform better compared to ESRGAN. We see that nESR- 
GAN+ has a better perceptual score on the PIRM datasets. 
This highlights the benefits of using the noise inputs in the 
generator network. However, there are still limitations asso- 
ciated with the noise injection’s generalization. Adding noise 
does not always result in better perceptual quality. This is 
the case for categories of images which do not fully exploit 
stochastic variation such as images of buildings in Urban100 
and OST300. Future works will focus on getting the most out 
of the Gaussian noise. 

Qualitative comparison is made in Figure [4] between our 
models and others based on PSNR and perceptual quality 
such as SRCNN, EnhanceNet, SRGAN, ESRGAN using 
images from the Setl4 dataset. It can be observed that the 
images reconstructed by our models present more detailed 
structures and are less distinguishable from the ground truth 
images when compared to the other pictures. Most of the 
original textures are kept like the boy’s complexion. 


5. CONCLUSION 


We have proposed ESRGAN+ and nESRGANG+ which out- 
perform other approaches as long as perceptual quality is con- 
cerned. A new basic block has been introduced to further in- 
crease the capacity of the network. Moreover, noise inputs 


are added to benefit from stochastic variation. All these im- 
provements have contributed to the generation of images with 
more natural textures as well as greater sharpness and details. 
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